-
Couldn't load subscription status.
- Fork 145
bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #9807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bpf-net_base
Are you sure you want to change the base?
Conversation
|
Upstream branch: 2dfd8b8 |
|
Upstream branch: 2dfd8b8 |
e0fa754 to
3630e56
Compare
5b09f8e to
21102ff
Compare
|
Upstream branch: 55d5a51 |
3630e56 to
13545ef
Compare
21102ff to
2017bdf
Compare
|
Upstream branch: 5e3fee3 |
13545ef to
0194467
Compare
2017bdf to
b0e0e9a
Compare
|
Upstream branch: 5e3fee3 |
0194467 to
ae3f177
Compare
b0e0e9a to
bf842fb
Compare
|
Upstream branch: 5e3fee3 |
ae3f177 to
50052b9
Compare
bf842fb to
731c728
Compare
|
Upstream branch: 5e3fee3 |
50052b9 to
e9a373c
Compare
731c728 to
e8ea10a
Compare
|
Upstream branch: 5e3fee3 |
e9a373c to
da68fcb
Compare
7999b10 to
ee50140
Compare
3c85f98 to
ee31d28
Compare
If memcg is enabled, accept() acquires lock_sock() twice for each new TCP/MPTCP socket in inet_csk_accept() and __inet_accept(). Let's move memcg operations from inet_csk_accept() to __inet_accept(). Note that SCTP somehow allocates a new socket by sk_alloc() in sk->sk_prot->accept() and clones fields manually, instead of using sk_clone_lock(). mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(), so I added the protocol check in __inet_accept(), but this can be removed once SCTP uses sk_clone_lock(). Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>
…ing. Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. If a socket has sk->sk_memcg, this memory is also charged to memcg as "sock" in memory.stat. We do not need to pay costs for two orthogonal memory accounting mechanisms. A microbenchmark result is in the subsequent bpf patch. Let's decouple sockets under memcg from the global per-protocol memory accounting if mem_cgroup_sk_exclusive() returns true. Note that this does NOT disable memcg, but rather the per-protocol one. mem_cgroup_sk_exclusive() starts to return true in the following patches, and then, the per-protocol memory accounting will be skipped. In __inet_accept(), we need to reclaim counts that are already charged for child sockets because we do not allocate sk->sk_memcg until accept(). trace_sock_exceed_buf_limit() will always show 0 as accounted for the memcg-exclusive sockets, but this can be obtained in memory.stat. Signed-off-by: Kuniyuki Iwashima <[email protected]> Nacked-by: Johannes Weiner <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>
If net.core.memcg_exclusive is 1 when sk->sk_memcg is allocated, the socket is flagged with SK_MEMCG_EXCLUSIVE internally and skips the global per-protocol memory accounting. OTOH, for accept()ed child sockets, this flag is inherited from the listening socket in sk_clone_lock() and set in __inet_accept(). This is to preserve the decision by BPF which will be supported later. Given sk->sk_memcg can be accessed in the fast path, it would be preferable to place the flag field in the same cache line as sk->sk_memcg. However, struct sock does not have such a 1-byte hole. Let's store the flag in the lowest bit of sk->sk_memcg and check it in mem_cgroup_sk_exclusive(). Tested with a script that creates local socket pairs and send()s a bunch of data without recv()ing. Setup: # mkdir /sys/fs/cgroup/test # echo $$ >> /sys/fs/cgroup/test/cgroup.procs # sysctl -q net.ipv4.tcp_mem="1000 1000 1000" Without net.core.memcg_exclusive, charged to memcg & tcp_mem: # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 22642688 <-------------------------------------- charged to memcg # cat /proc/net/sockstat| grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 5376 <-- charged to tcp_mem # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53188 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:49972 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53868 ESTAB 2000 0 127.0.0.1:34479 127.0.0.1:53554 # nstat | grep Pressure || echo no pressure TcpExtTCPMemoryPressures 1 0.0 With net.core.memcg_exclusive=1, only charged to memcg: # sysctl -q net.core.memcg_exclusive=1 # prlimit -n=524288:524288 bash -c "python3 pressure.py" & # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 2757468160 <------------------------------------ charged to memcg # cat /proc/net/sockstat | grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <- NOT charged to tcp_mem # ss -tn | head -n 5 State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:49026 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:45630 ESTAB 110000 0 127.0.0.1:36019 127.0.0.1:44870 ESTAB 111000 0 127.0.0.1:36019 127.0.0.1:45274 # nstat | grep Pressure || echo no pressure no pressure Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>
We will support flagging SK_MEMCG_EXCLUSIVE via bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook. BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk() that passes a pointer to struct sock to the bpf prog as void *ctx. But there are no bpf_func_proto for bpf_setsockopt() that receives the ctx as a pointer to struct sock. Also, bpf_getsockopt() will be necessary for a cgroup with multiple bpf progs running. Let's add new bpf_setsockopt() and bpf_getsockopt() variants for BPF_CGROUP_INET_SOCK_CREATE. Note that inet_create() is not under lock_sock() and has the same semantics with bpf_lsm_unlocked_sockopt_hooks. Signed-off-by: Kuniyuki Iwashima <[email protected]>
If a socket has sk->sk_memcg with SK_MEMCG_EXCLUSIVE, it is decoupled
from the global protocol memory accounting.
This is controlled by net.core.memcg_exclusive sysctl, but it lacks
flexibility.
Let's support flagging (and clearing) SK_MEMCG_EXCLUSIVE via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.
u32 flags = SK_BPF_MEMCG_EXCLUSIVE;
bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS,
&flags, sizeof(flags));
As with net.core.memcg_exclusive, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).
SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:
1. UDP charges memory under sk->sk_receive_queue.lock instead
of lock_sock()
2. For TCP child sockets, memory accounting is adjusted only in
__inet_accept() which sk->sk_memcg allocation is deferred to
3. Modifying the flag after skb is charged to sk requires such
adjustment during bpf_setsockopt() and complicates the logic
unnecessarily
We can support other hooks later if a real use case justifies that.
Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with SK_MEMCG_EXCLUSIVE. This will be more
visible under tcp_mem pressure.
# bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
kretprobe:__sk_mem_raise_allocated /@start[tid]/
{ @EnD[tid] = nsecs - @start[tid]; @times = hist(@EnD[tid]); delete(@start[tid]); }'
# tcp_stream -6 -F 1000 -N -T 256
Without bpf prog:
[128, 256) 3846 | |
[256, 512) 1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 198207 |@@@@@@ |
[2K, 4K) 31199 |@ |
With bpf prog in the next patch:
(must be attached before tcp_stream)
# bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create
# bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg
[128, 256) 6413 | |
[256, 512) 1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 117031 |@@@@ |
[2K, 4K) 11773 | |
Signed-off-by: Kuniyuki Iwashima <[email protected]>
The test does the following for IPv4/IPv6 x TCP/UDP sockets
with/without SK_MEMCG_EXCLUSIVE, which can be turned on by
net.core.memcg_exclusive or bpf_setsockopt(SK_BPF_MEMCG_EXCLUSIVE).
1. Create socket pairs
2. Send NR_PAGES (32) of data (TCP consumes around 35 pages,
and UDP consuems 66 pages due to skb overhead)
3. Read memory_allocated from sk->sk_prot->memory_allocated and
sk->sk_prot->memory_per_cpu_fw_alloc
4. Check if unread data is charged to memory_allocated
If SK_MEMCG_EXCLUSIVE is set, memory_allocated should not be
changed, but we allow a small error (up to 10 pages) in case
other processes on the host use some amounts of TCP/UDP memory.
The amount of allocated pages are buffered to per-cpu variable
{tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv
before reported to {tcp,udp}_memory_allocated.
At 3., memory_allocated is calculated from the 2 variables at
fentry of socket create function.
We drain the receive queue only for UDP before close() because UDP
recv queue is destroyed after RCU grace period. When I printed
memory_allocated, UDP exclusive cases sometimes saw the non-exclusive
case's leftover, but it's still in the small error range (<10 pages).
bpf_trace_printk: memory_allocated: 0 <-- TCP non-exclusive
bpf_trace_printk: memory_allocated: 35
bpf_trace_printk: memory_allocated: 0 <-- TCP w/ sysctl
bpf_trace_printk: memory_allocated: 0
bpf_trace_printk: memory_allocated: 0 <-- TCP w/ bpf
bpf_trace_printk: memory_allocated: 0
bpf_trace_printk: memory_allocated: 0 <-- UDP non-exclusive
bpf_trace_printk: memory_allocated: 66
bpf_trace_printk: memory_allocated: 2 <-- UDP w/ sysctl (2 pages leftover)
bpf_trace_printk: memory_allocated: 2
bpf_trace_printk: memory_allocated: 2 <-- UDP w/ bpf (2 pages leftover)
bpf_trace_printk: memory_allocated: 2
We prefer finishing tests faster than oversleeping for call_rcu()
+ sk_destruct().
The test completes within 2s on QEMU (64 CPUs) w/ KVM.
# time ./test_progs -t sk_memcg
#370/1 sk_memcg/TCP :OK
#370/2 sk_memcg/UDP :OK
#370/3 sk_memcg/TCPv6:OK
#370/4 sk_memcg/UDPv6:OK
#370 sk_memcg:OK
Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED
real 0m1.609s
user 0m0.167s
sys 0m0.461s
Signed-off-by: Kuniyuki Iwashima <[email protected]>
|
Upstream branch: cbf33b8 |
ee50140 to
70c7ca2
Compare
5835fd6 to
d4d3232
Compare
254f31d to
e61c52a
Compare
c0aa5e7 to
d746c5c
Compare
Pull request for series with
subject: bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
version: 10
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1004438